Background
On October 7, 2023, Hamas and other Palestinian armed groups orchestrated a deadly attack on Israel. The horrific attack killed 1,200 people, with over 200 hostages seized and over 100 still unaccounted for. Israeli forces began airstrikes and ground operations in response
- The ongoing conflict has devastated the civilian population of Gaza. Seventy five percent of the population of Gaza has been displaced, most multiple times, and the entire population is in need of humanitarian assistance. The ongoing conflict, bombardment and blockade has led to catastrophic humanitarian suffering for more than 2 million Palestinians—half of them children—who are now without clean water, food and vital medical services.
- Since the beginning of the war in October 7th, 2023, Airwars has monitored open source civilian harm incidents in Gaza. Incident is defined as an explosive weapon or ground battle operation that produced civilian casualties or harm. Civilian status of victims is assumed unless there is information determining their militant status. Data is derived from tweets or Facebook postings, translated from Arabic-to-English and when names are provided they are counted as a casualty. These names are corroborated with the Hamas MoH when they are released. All data is presented on their website, and each incident has its own web-page.
Aim of Study
Given the comprehensive nature of Airwars’ data, this study aims to explore the potential of open-source information to identify patterns in the targeting of Palestinian civilians in Gaza. We scraped and processed all available incident reports from the Airwars website, and storing in a publicly accessible SQLite database. To enrich the dataset, we reversed geocoded incident coordinates to query via the Nominatim API to return the type of location in which the attack took place (i.e., school, mosque, park, etc.). Additionally, to understand the general emotional tone in these assessments we used sentiment analysis to classify text into various emotional states. Finally, we hierarchically clustered incidents based on casualties and emotional scores to identify patterns in targeting and geographic associations with children and women casualties in Gaza.
Purpose of Document
This document outlines all of the pieces of this project in order for others to be able to collaborates. Given that some of the packages used in the code, such as text, is easier to just follow the website’s instruction on how to install. Otherwise we have written out our anaconda environment to a yml file here to be used more as a reference for package/library reference. Some packages will not be able to be built from conda and would require cran. In the near future, we will export out a full bash script that installs this environment using both the anaconda repository and cran.
Computing Resources:
This project was created on a lenovo legion 7i laptop with a i9-14900HX chip, 64gb of DDR5 RAM, and an Nvidia RTX-4070 GPU with 8gb of GDDR6. The operating system initially used was Ubuntu 24.10. However, when we attempted to configure our GPU to process text data for the language model we learned that this version of Ubuntu contains the newest kernel which updates nvidia-cli and cuda drivers that are not compatible with tensorflow or pytorch needed for the text package. We moved to Ubuntu 24.04 LTS within WSL2 on Windows 11 and we were able to configure the GPU. Given that Airwars also goes back and corrects archived incidents, it is easier to just run the full process on all available incident records when needed, and the GPU cuts back on the processing time. Modeling data on a GPU for the language model cut the process time from 2.5 hours to about 20 minutes.
We use r-base 4.4.1 from anaconda and rstudio 2024.04.02. We have also used these same packages on rstudio-server via WSL2 but prefer to isolate the computing environment.
We store all of our data in a SQLite database that can also be found in our github repository
Structure of SQlite database
The scrapped data resulted in over 800 unique events stored in two tables in a SQLite database.
Table 1 contains incident metadata (e.g., unique id, incident date, web-page URL).
Table 2 stores the specific incident information such as the number of deaths, breakdown of deaths (children, adults), type of attack, and cause of death, incident coordinates and results from Nominatim, and sentiment scores for seven emotional states.
Table 3 contains the Hamas Ministry of Health (MoH) daily casualties.
The first two tables relate to each other through the unique incident identification numbers provided by Airwars. We relate the MoH table with the Airwars tables by aggregating up to the date.
# connect to database
mydb <- dbConnect(
RSQLite::SQLite(),
"~/repos/airwars_scraping_project/database/airwars_db.sqlite")
dbListTables(mydb) # print tables in database[1] "airwars_incidents" "airwars_meta" "daily_casualties"
Scraping Airwars Civilian Casualty Incidents
- The image below is an example of the Airwars incident metadata that are presented as baseball cards. This information is presented in one web-page and we start our workflow at this junction. We read the main website in Airwars that houses this information and only scrape Incident Date and Incident ID to build specific incident web-URLs that we later scrape for content.
- All of the code to conduct the scraping and processing of these data are found in our github page under code/scrape_process_incidence, which this code has been optimized and takes about 30 minutes on our laptop (32gb of RAM is sufficient) with a fast internet connection.
- Here we only explain how we pre-processed the data as it related to preparing for analysis.
- For a lot of the scraping we used selectorgadget to get the xpath and pass it through the Rvest package.
- Metadata table: We scrape the main Airwars website parse information we need to build a table containing each incident’s web-url (over 800 URLs), as seen in the example below.
# read in data tables
airwars_meta <- tbl(mydb, "airwars_meta") |>
as_tibble() |>
# convert Incident_Date to date format
mutate(Incident_Date = as_date(Incident_Date)) |>
arrange(Incident_Date)
airwars_meta |> head() |> kable()| Incident_Date | Incident_id | link |
|---|---|---|
| 2023-10-07 | ispt0019a | https://airwars.org/civilian-casualties/ispt0019a-october-7-2023/ |
| 2023-10-07 | ispt0019 | https://airwars.org/civilian-casualties/ispt0019-october-7-2023/ |
| 2023-10-07 | ispt0017 | https://airwars.org/civilian-casualties/ispt0017-october-7-2023/ |
| 2023-10-07 | ispt0011 | https://airwars.org/civilian-casualties/ispt0011-october-7-2023/ |
| 2023-10-07 | ispt0010 | https://airwars.org/civilian-casualties/ispt0010-october-7-2023/ |
| 2023-10-07 | ispt0003 | https://airwars.org/civilian-casualties/ispt0003-october-7-2023/ |
- Using the web-urls we built in the metadata table, we loop through them and scrape each URL (see below for example) to parse incident assessments as seen below.
- Each incident contains an assessment section detailing what transpired during the incident, whom was known to be involved and the victims it produced. We will use this text to get emotional scores later.
- Incidence table: Our final table contains the fields that Airwars populates for each incident. Besides parsing this information we also had to process the data, specifically, there are fields that contain ranges of kills (i.e., 3-5) or counts (i.e., 1 child, 3 women, 1 man) which we had to strip these strings into their own columns. This allows us to estimate how many children and women have been reported as civilian casualties. Our data contains 24 variables with 804 incidents reported by Airwars.
airwars_incidents <- tbl(mydb, "airwars_incidents") |>
as_tibble()
airwars_incidents |>
head() |>
select(-assessment:-surprise) |>
DT::datatable()
MoH Daily Casualties
Palestine Dataset published daily Gaza casualty counts that they take from the Hamas MoH; however, they do not make a distinction whether a casualty was a civilian or militant so their numbers should be higher than what we derive from Airwars.1
- We use the Palestine API https://data.techforpalestine.org/api/v2/casualties_daily.json and parse the JSON to save into our database after a little bit of data wrangling.
tbl(mydb, "daily_casualties") |>
as_tibble() |>
mutate(Incident_Date = lubridate::as_date(Incident_Date)) |>
head() |>
DT::datatable()
Enriching Data with Reverse geocoding
Airwars when possible includes location coordinates of where the incident took place. Although this information is contained within the assessment, Airwars standardizes it’s location with a heading under “Geolocation notes” which we were able to parse the latitude and longitude to use for geographic plotting. Of the 804 Incidents about 65% contain geographic coordinates.
- We used the Nominatim open street map API to reverse geocode our coordinates and bring back the type of location that was the location target for incidents that contained coordinates. We also save out a boundary box set of coordinates.
airwars_incidents |> select(target_type, contains("lat"), contains("long")) |> head() |> DT::datatable()
Sentiment Analysis
After attempting several text classification models and some question/context model we landed onj-hartmann/emotion-english-distilroberta-base because it goes beyond just a positive/negative evaluation but analysis text for Ekman’s 6 basic emotions that is common in psychological work on emotions.Moreover, this model affords us the ability to examine the emotion tone over time for these assessments.2
We get scores for each emotions, the closer to one the stronger the association, while all the scores add up to 1.
The model is trained on a balanced subset from the datasets listed above (2,811 observations per emotion, i.e., nearly 20k observations in total). 80% of this balanced subset is used for training and 20% for evaluation. The evaluation accuracy is 66% (vs. the random-chance baseline of 1/7 = 14%).
Given that we have over 800 assessments we decided to use text3 while it allows us the ability to use a laptop GPU (GTX 4070)4 to process these models for each incident. This resulted in large processing gains.
Below we print an example of these scores while we truncate the assessment text.
airwars_incidents |>
slice_sample(n=1) |>
select(assessment:surprise) |>
mutate(assessment = str_trunc(assessment, 200),
across(where(is.double), ~ round(.x, 2))) |>
DT::datatable()
Footnotes
Note. Confidence is low to moderate since the data comes from the Hamas MoH.↩︎
- The model is trained on a balanced subset from the datasets listed above (2,811 observations per emotion, i.e., nearly 20k observations in total). 80% of this balanced subset is used for training and 20% for evaluation. The evaluation accuracy is 66% (vs. the random-chance baseline of 1/7 = 14%).
An R-package for analyzing natural language with transformers from HuggingFace using Natural Language Processing and Machine Learning.↩︎
The installation for Text is tricky as the right python libraries must be installed. To compile models with the GPU, we learned that nvidia cuda drivers must be installed for version 12.1. Additionally, we could only get this to work via anaconda within Ubuntu 24.04 installed through WSL2 on Windows 11. Ubuntu 24.10 comes with a kernal that forces cuda 12.8 to be installed and did not work for us in a dual boot system.↩︎